1 Learning Outcomes

1.1 References:

2 Set-up:

2.1 Check if you have Git installed. If not, install Git:

2.2 Create an account on GitHub: https://github.com/

  • Select a free plan. If asked for a billing address, use the same email address you used to create your account (no charges will be billed).
  • When you choose your username, “Choose Wisely” - you might use this for professional purposes.
  • Recommend something related to your name (e.g., my AU-related account username is “rressler”). I would recommend against something like “richard_awesome_granddad”.

2.3 Update your Git Configuration using Git

  • Git uses a global config file that is hidden in your home directory to track many configuration settings (on a Mac the home directory is typically ~).

  • To tell Git your name and email address, Open up a terminal window in R Studio and type:

    git config --global user.name "YOUR FULL NAME"
    git config --global user.email "YOUR EMAIL ADDRESS"
  • If you are worried about email privacy, follow GitHub’s instructions here.

  • Extra Setup Step for Windows Users To use the Git Bash terminal in R Studio in Windows, you need to make sure it is set as the default.

    1. Go to Tools > Global Options… > Terminal.
    2. Make sure you have “New terminals open with: Git Bash”.
    3. Exit out of the old terminal. Restart R Studio.
    4. Then do Tools > Terminal > New Terminal.

2.4 (Optional) Set the Git text editor to your favorite editor instead of the default Vim. I like to use Atom.

  • <https://swcarpentry.github.io/git-novice/02-setup/> has the commands for different editors.

  • If you want to use Atom,

    • Go to <https://atom.io/> and install the version for your computer type.

    • Once you have installed the Atom editor, install an Atom package for the terminal

    • In the Welcome Guide click on install package and then open installer

    • Enter platformio-ide-terminal and click install

    • You can then open up a terminal window with the + sign at the bottom left

    • To make Atom the default text editor for Git,

      • Go to a terminal window (in R studio or Atom) and enter git config --global core.editor "atom --wait"
      • This will update your hidden .git config file in your home directory.

2.5 Check your settings

  • Enter git config --list and hit enter to scroll down till you get to the end.
  • Enter q to quit reading and get your cursor back,

3 Why Bother using a Version Control System (VCS)

3.1 Motivation 1: Change code without the fear of breaking it

  • You want to try out something new, but you aren’t sure if it will work.

  • Non-git solution: Copy and rename the files over and over

    • analysis.R,
    • analysis2.R,
    • analysis3.R,
    • analysis_final.R,
    • analysis_final_final.R,
    • analysis_absolute_final.R,
    • analysis7.R
    • analysis8.R
  • Issues:

    • Difficult to remember differences among files.
    • Which files produced specific results?
    • Requires a lot of careful documentation and user bookkeeping (not likely to happen!)
  • Git lets you change files while automatically keeping track of old versions. It is easy to revert back to old versions if you decide the new changes don’t work.

3.2 Motivation 2: Easy Collaboration

  • In a group setting, your collaborators might (will) suggest how to change your analysis/code.

  • First non-git solution: Email files back/forth.

    • Issues:
      • You have to manually incorporate changes.
      • Only one person can work on the code at a time (otherwise multiple changes might be incompatible).
  • Second non-git solution: Share a Dropbox or Google Docs folder (a “centralized” version control system).

    • Issues:
      • Again, only one person can work on the code at a time.
      • Less user-friendly for tracking changes.
      • Difficult to run excursions
  • Git lets each individual work on their own local repository and offer their changes for review in a way that allows you to control which changes get approved to be incorporated into the baseline and then automatically incorporate those changes. Documentation of changes is built into the workflow (so it actually happens!).

3.3 Motivation 3: High demand skill for future employment

3.4 Motivation #4 - The Foundation for GitHub

  • You can make your final-project repo public so prospective employers can view your work.

  • You can host a website on GitHub, increasing your visibility. Professor Gerard hosts his personal website and teaching websites on GitHub.

4 Git Overview

4.1 Common Git Commands

  • All Git commands begin with git followed immediately by an argument for the type of command you want to execute.

  • For the right-hand-side of the diagram, the following are the useful Git commands:

    • git init: Initialize a Git repository. Only do this once per project/repository.
    • git status: Show which files are staged in your working directory, and which are modified but not staged.
    • git add: Add modified files from your working directory to the stage.
    • git diff: Look at how files in the working directory have been modified.
    • git diff --staged: Look at how files in the stage have been modified.
    • git commit -m "[descriptive message]": commit your staged content as a new commit snapshot.

4.2 Repositories and Folder Structure (Housekeeping)

  • A repository (or repo, for short) is a collection of files (in a folder and its subfolders) being version controlled (configuration managed) as a set.

  • The repo also contains the local version control data, usually in a hidden folder and files.

  • In data science, each repository is typically one project (like an analysis, a model, a homework, or a collection of code that performs a similar task).

  • Recommend creating a folder somewhere on your computer called STAT_413 or STAT_613 with three sub-folders for Lectures, Homework, and Project - note no spaces in the names.

  • Suggest treating Lectures and Project as repositories and create lower level folders for each class period if you wish.

  • Suggest creating a subfolder under Homework for each week’s assignment and treating subfolder as their own repository (which will usually be how they come with their own set of folders

  • These repositories will constitute your local repositories within which you will navigate to various working directories to manage your files and sync updates with Git and between Git and GitHub.

5 Git Basics

5.1 Intro

  • We’ll learn some Git as we examine a topic from the famous paper of Oeppen and Vaupel (2002).

  • Oeppen and Vaupel (2002) found perhaps the strongest association in social science: a linear relationship between year of birth and the maximum life expectancy where the maximum is taken over countries. We’ll examine this relationship for ourselves.

  • We’ll use the gapminder_unfiltered data frame from the gapminder library. The variables in this data frame are:

    • country: The name of the country.
    • continent: The continent of the country.
    • year: The year of the measurement. From 1952 to 2007.
    • lifeExp: The life-expectancy of at birth, in years, of an individual.
    • pop: Population.
    • gdpPercap: GDP per capita (US$, inflation-adjusted).
  • Create a folder called life_exp, e.g., under STAT_X13/Lectures/Week01_git.

  • Create an R Markdown file called “life_exp_analysis.Rmd” within the life_exp folder. Your Rmd might look something like this:

  • Save `life_exp_analysis.Rmd".

5.2 Initialize (Create) a local repository

  • Open up a terminal and navigate so the working directory is STAT_X13/Lectures.

  • Your R Studio should look something like:  

  • Use the command git init to create a repository.

  • Now type in the terminal

    git init
  • You’ve just created a Git repository! That means there is a .git hidden folder tracking all of the changes you make for the files you tell it about.

    ls -a
  • However, Git won’t track any files until you tell it to.

5.2.1 Status

  • Use git_status to see what files Git is tracking and which are untracked.

    git status
  • The output should tell you that life_exp_analysis.Rmd is not tracked. In fact you should have no tracked files.

5.3 Stage Files

  • Use git add to add files to the stage.

    git add life_exp_analysis.Rmd
  • Always check which files have been added:

    git status
  • Useful flags for git add:

    • --all will stage all modified and untracked files.
    • --update will stage all modified files, but only if they are already being tracked.

5.4 Commit Files

  • Use git commit to commit files that have been staged to create snapshots in the commit history.

    • The -m argument will allow/require you to make a comment about the commit.

      git commit -m "New life exp rmd file."
  • Your message (written after the -m argument) should be concise, and describe what has been changed since the last commit.

  • If you forget to add a message, Git will open up your default text-editor where you can write down a message, save the file, and exit. The commit will occur after you exit the text editor.

  • If your default text editor is vim, exit it using “escape” and then type ‘:q’. See this for more options.

  • git status should now be clear because there are no modified files:

    git status
  • You can see all of your commits using git log.

    git log
  • You have now completed the workflow on the right side of the Git diagram.

5.4.1 Exercise:

  • As a next step in the analysis of the gapminder_unfiltered data frame,
  1. Add code into life_exp_analysis.Rmd that
    • Loads the gapminder_unfiltered data frame into R, and
    • Calculates the maximum life expectancy each year and the corresponding country that had that maximum life expectancy. Hint: There are multiple ways to do this, but the easiest involves group_by() and filter().
  2. Change the header from “Analysis” to “Life Expectancy Analysis”.
  3. Save life_exp_analysis.Rmd.

5.5 Look at changes

  • Use git diff to see changes in all modified files.

    git diff
  • Lines after a “+” are being added. Lines after a “-” are being removed.

  • When there are a lot of lines that fill your terminal window, you can exit git diff by hitting q.

  • Check the status of your modified files.

  • Stage your modified files, but don’t commit yet.

  • Recheck your status.

  • git diff won’t check for changes in the staged files by default. But you can see the differences using git diff --staged.

    git diff
    git diff --staged
  • Commit your changes. Use a nice commit message.

6 Using GitHub as a remote repository for version control

6.1 Create a Repository on GitHub

  • Create a repo on GitHub by selecting “New” on the homepage:

     

  • Or go to the “Repositories” tab and select “New”



  • Tell GitHub the name of your repo. In general, it can be a different name than the repo on your local machine.

    • For this class, name it “life_exp_USERNAME” where “USERNAME” is your GitHub username.
  • Make a small description.

  • To avoid errors, do not initialize the new repository with README, license, or gitignore files. You can add these files after your project has been pushed to GitHub.

  • Then click “Create Repository”.



6.2 Tell your local Git where GitHub will host your repository.

  • The URL is the location of the repo. It is generally of the form “https://github.com/GHUser/GitHubRepoName.git” where “GitHubRepoName” is whatever you chose to name the repo on GitHub.

    • You should have your own GHUser name
  • The command git remote tells Git to do something associated with a remote repository. In this case we want to add a new one. We need to tell Git the name and location of the added remote repository.

  • GitHub allows you to copy the URL for your new repository - you can use the Clone or Download button to copy it to your clipboard for pasting into your terminal. It also gives you suggestions for the commands to use.


  • Use git remote add to tell Git where we will host our repo.

    git remote add origin https://github.com/rressler/life_exp_rressler.git
  • In the above command, “origin” is just a nickname we gave to the location URL that is hosting our repo. We could have used “github” or “deep_space_nine” instead, but “origin” or “upstream” are traditional names.

6.3 Push files from your local repo Master to the remote GitHub Origin

  • Use git push to push commits to GitHub.

  • If you are pushing to a brand new repo on GitHub, you need to use:

    git push -u origin master
  • The terminal should ask you for your GitHub username and password.

  • When the push is complete, your code is now up on GitHub.  

  • Here we are pushing to the remote repository (origin) from our current local branch (master).

    • It can be confusing to remember which is origin and which is master - perhaps use the hint origin is for “Other computers” and master is for “My computer”
  • The -u is needed since this is a new repository. It tells Git to connect the behind-the-scenes tracking information between origin (GitHub) and master (git).

    • It is equivalent to the –set-upstream option where the “upstream” location is the GitHub origin and it is “upstream” from the local master.
  • For all subsequent pushes for the same repos, once the origin and master repos are connected, you can just type:

    git push

6.3.1 Exercise

  • Add some comments before your code chunk in life_exp.Rmd describing what the code is doing.
  • Save the file, stage the file for commit, commit the changes, then push the changes to GitHub.
  • Everyone should now have a repository on GitHub with an updated life_exp_analysis.Rmd file in it.

7 Sharing GitHub Repositories with Others

7.1 GitHub Commands for Sharing Repositories

  • GitHub introduced the concept of a collaboration workflow with commands to support the managed sharing of repositories.
  • Use fork or download to create separate copies with no links to the original
  • Use git clone to copy a repository to your local machine and maintain a link to the original (origin) on GitHub.

7.1.1 Fork a Repository

  • The GitHub fork command creates a separate and distinct copy of a given GitHub repo on your GitHub Account - with no links to the original repo.
  • You can make all the changes you want to the files in a forked repo but the original owner will not see them directly - it is a separate repo on GitHub (it’s not on your local system- yet)
  • You can request others to look at (pull) your code for review and possible merging into their original baseline.

7.1.1.1 Exercise Continued

  • Use fork to create a copy of someone else’s repo so you can extend their code to finalize the plot.
    • Use Chat to post your GitHub username and select a partner to work with.
    • Search for a person on GitHub, and go to their “life_exp_USERNAME” repository. The url should be something like “https://github.com/PARTNERNAME/life_exp_PARTNERNAME” where “PARTNERNAME” is your partner’s username.
    • At the top right, on the level of the repo name, click the “Fork” button.
  • You should now have a copy of your partner’s repo on your own GitHub account page.

7.1.2 Clone

  • Cloning uses Git to create a copy of a given GitHub repo on your machine and maintain a link to the original repo.
  • You can push updates to the original if and only if you have write privileges
    • If you do not have write privileges - say it is someone else’s repository, then Fork the repo first so you have it on your own account and then clone it. That way you can upload any changes to GitHub. You can also issue a pull request to the original owners if you want to share your code with them for review.
  • Go the the terminal window and create/navigate to the desired directory for a new repo. Then use the command git clone URL to create a local version of the repo you want GitHub.
  • Look at help with git clone -h

7.1.2.1 Exercise Continued

  • Create/Navigate to a new directory on your local machine where you would like to place your partner’s Git repo (outside of your own local Git repo) .

  • Clone the forked version of your partner’s repo to your local machine:

    • You can copy the URL from GitHub or type it in:

      git clone https://github.com/USERNAME/life_exp_PARTNERNAME.git
  • You should be able to see the files from the new repo in your directory:

    ls
    ls -a
  • We will do this a lot throughout the course for assignments etc.

7.1.3 Workflow for Updates: Edit, add, commit, push

  • Enter the new repo on your machine - the cloned version of the fork of your partner’s repo.

7.1.3.1 Exercise Continued

  1. Edit your partner’s file to add code to create a scatterplot of year versus maximum life expectancy. Color code by country, and add a single Ordinary Least Squares (OLS) line.
  • Your plot should look like this:

    ## `geom_smooth()` using formula 'y ~ x'

  1. Save the modified file, then stage it, and commit the changes with a comment.
  2. Push the changes to your forked repo on GitHub.
  • You should now see the changes in your repo on GitHub

8 Basic Collaboration with GitHub - Pull Requests, Merging, and Updating

8.1 Submit a pull request

  • A “pull request” is a request in GitHub from you to your partner to “pull” (create a virtual copy of) your code and review it in case they want to incorporate the changes you made into their baseline file in their repository.

  • Navigate to your forked version of your partner’s repo up on GitHub. There are two ways to generate a new request.

    1. In the Code tab, click on “New pull request”, or,

     

    1. Click on the Pull Requests tab and click on “New pull request”

       

  • Write an informative title and message on what your code does, (how it fixes an issue or adds new functionality - why they should use it) then click “Create pull request”.

     

8.2 Accept a pull request - Merging Code from another branch or source into your Baseline

  • A pull request is a request to review someone’s code for possible inclusion into the baseline.

  • If you accept the request once you have reviewed the code, you can initiate a Merge which will update the baseline to include the changes in the submitted code.

  • Use the “Pull requests” tab on your dashboard or in your repository to view all the pull requests folks have submitted to you.

     

8.2.0.1 Exercise Continued

  • Navigate to the pull request your partner sent you. Then you can see the changes they made under the “Files changed” tab.

  • You can write comments — for example asking them to change the code before you accept the merge request.

  • Or, you can can just accept the merge request by hitting “Merge pull request” and then “Confirm Merge”.


  • This was a case where there should not have been any merge conflicts. If however, two people are working on the same line of code, that will create a merge conflict that GitHub does not know how to resolve. You will have to decide collectively what should go into the baseline and each update the files to eliminate the conflict.

  • Once the merge is complete, the screen will update to show the pull request is closed. It will also offer the opportunity to delete the forked repo.

  • If you are done with it, that is a good practice from a housekeeping perspective. However, if you have multiple issues you are working, do not delete until after all issues are closed.


8.3 Update your local machine with Changes to the GitHub baseline using Git.

  • Now that you have updated the baseline, there may be other changes on the GitHub baseline you want to incorporate back into your copy of the files.

  • Use git pull whenever there are modifications on GitHub and you want to bring your local repo up-to-date.

  • Go back to your original directory life_exp. Check the status. It has no idea your baseline file on GitHub has been updated with the graph code. You want to keep your GitHub remote repo and your local repo in sync.

  • To update your local file with the changes from the remote (GitHub) make sure you are back in the correct working directory. The Git pull command does two things.

    1. It executes a fetch command to copy the updated code to your machine and then,
    2. It automatically executes a merge command to update your local files.
  • If you changed your local files while you also changed your remote files, you could have created a merge conflict you will need to resolve. Git will tell you about it.

    git pull

8.4 Using Issues to collaborate

  • GitHub allows you to generate “Issues” to identify questions or tasks to be resolved for a piece of code.
  • If you are part of a Team, issues can be assigned (even to yourself) or unassigned to keep track of the work flow across team members and reduce the potential for merge conflicts.
  • Issues should focus on discrete areas - one bug to be fixed or added, not a whole list of problems in different files or topics.


9 Best Practices

9.1 Parting Wisdom from XKCD

 

References

Oeppen, Jim, and James W. Vaupel. 2002. “Broken Limits to Life Expectancy.” Science 296 (5570): 1029–31. https://doi.org/10.1126/science.1069675.


  1. graphic from Mark Lodato↩︎